Deterministic algorithms for sampling count data
نویسندگان
چکیده
Processing and extracting meaningful knowledge from count data is an important problem in data mining. The volume of data is increasing dramatically as the data is generated by day-to-day activities such as market basket data, web clickstream data or network data. Most mining and analysis algorithms require multiple passes over the data, which requires extreme amounts of time. One solution to save time would be to use samples, since sampling is a good surrogate for the data and the same sample can be used to answer many kinds of queries. In this paper, we propose two deterministic sampling algorithms, Biased-L2 and DRS. Both produce samples vastly superior to the previous deterministic and random algorithms, both in sample quality and accuracy. Our algorithms also improve on the run-time and memory footprint of the existing deterministic algorithms. The new algorithms can be used to sample from a relational database as well as data streams, with the ability to examine each transaction only once, and maintain the sample on-the-fly in a streaming fashion. We further show how to engineer one of our algorithms (DRS) to adapt and recover from changes to the underlying data distribution, or sample size. We evaluate our algorithms on three different synthetic datasets, as well as on real-world clickstream data, and demonstrate the improvements over previous art. Preprint submitted to Elsevier 12 May 2007
منابع مشابه
Ef£cient Data Reduction with EASE
A variety of mining and analysis problems — ranging from association-rule discovery to contingency table analysis to materialization of certain approximate datacubes — involve the extraction of knowledge from a set of categorical count data. Such data can be viewed as a collection of “transactions,” where a transaction is a fixed-length vector of counts. Classical algorithms for solving count-d...
متن کاملML-DS: A Novel Deterministic Sampling Algorithm for Association Rules Mining
Due to the explosive growth of data in every aspect of our life, data mining algorithms often suffer from scalability issues. One effective way to tackle this problem is to employ sampling techniques. This paper introduces, ML-DS, a novel deterministic sampling algorithm for mining association rules in large datasets. Unlike most algorithms in the literature that use randomness in sampling, our...
متن کاملRidge Regression and Provable Deterministic Ridge Leverage Score Sampling
Ridge leverage scores provide a balance between low-rank approximation and regularization, and are ubiquitous in randomized linear algebra and machine learning. Deterministic algorithms are also of interest in the moderately big data regime, because deterministic algorithms provide interpretability to the practitioner by having no failure probability and always returning the same results. We pr...
متن کاملSpatial count models on the number of unhealthy days in Tehran
Spatial count data is usually found in most sciences such as environmental science, meteorology, geology and medicine. Spatial generalized linear models based on poisson (poisson-lognormal spatial model) and binomial (binomial-logitnormal spatial model) distributions are often used to analyze discrete count data in which spatial correlation is observed. The likelihood function of these models i...
متن کاملThe Art of Data Augmentation
The term data augmentation refers to methods for constructing iterative optimization or sampling algorithms via the introductionof unobserved data or latent variables. For deterministic algorithms, the method was popularized in the general statistical community by the seminal article by Dempster, Laird, and Rubin on the EM algorithm for maximizing a likelihood function or, more generally, a pos...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
- Data Knowl. Eng.
دوره 64 شماره
صفحات -
تاریخ انتشار 2008